Goto

Collaborating Authors

 engineering team


TaskEval: Synthesised Evaluation for Foundation-Model Tasks

Widanapathiranage, Dilani, Barnett, Scott, Kurniawan, Stefanus, Takerngsaksiri, Wannita

arXiv.org Artificial Intelligence

Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval synthesiser that selects or generates an appropriate set of evals. We implement our approach in \toolname and demonstrate the concept on two diverse FM tasks: chart data extraction and document question answering. A preliminary evaluation on the quality of our selected evals shows 93\% and 90\% accuracy respectively. Our research tackles a growing problem facing engineering teams, how to evaluate and review outputs from FM tasks.


Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Gandhi, Kahaan, Bolliet, Boris, Zubeldia, Inigo

arXiv.org Artificial Intelligence

We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent


Anomaly Detection for IoT Global Connectivity

Iglesias, Jesus Omaña, Perales, Carlos Segura, Geißler, Stefan, Perino, Diego, Lutu, Andra

arXiv.org Artificial Intelligence

Internet of Things (IoT) application providers rely on Mobile Network Operators (MNOs) and roaming infrastructures to deliver their services globally. In this complex ecosystem, where the end-to-end communication path traverses multiple entities, it has become increasingly challenging to guarantee communication availability and reliability. Further, most platform operators use a reactive approach to communication issues, responding to user complaints only after incidents have become severe, compromising service quality. This paper presents our experience in the design and deployment of ANCHOR -- an unsupervised anomaly detection solution for the IoT connectivity service of a large global roaming platform. ANCHOR assists engineers by filtering vast amounts of data to identify potential problematic clients (i.e., those with connectivity issues affecting several of their IoT devices), enabling proactive issue resolution before the service is critically impacted. We first describe the IoT service, infrastructure, and network visibility of the IoT connectivity provider we operate. Second, we describe the main challenges and operational requirements for designing an unsupervised anomaly detection solution on this platform. Following these guidelines, we propose different statistical rules, and machine- and deep-learning models for IoT verticals anomaly detection based on passive signaling traffic. We describe the steps we followed working with the operational teams on the design and evaluation of our solution on the operational platform, and report an evaluation on operational IoT customers.


A taxonomy of explanations to support Explainability-by-Design

Tsakalakis, Niko, Stalla-Bourdillon, Sophie, Huynh, Trung Dong, Moreau, Luc

arXiv.org Artificial Intelligence

As automated decision-making solutions are increasingly applied to all aspects of everyday life, capabilities to generate meaningful explanations for a variety of stakeholders (i.e., decision-makers, recipients of decisions, auditors, regulators...) become crucial. In this paper, we present a taxonomy of explanations that was developed as part of a holistic 'Explainability-by-Design' approach for the purposes of the project PLEAD. The taxonomy was built with a view to produce explanations for a wide range of requirements stemming from a variety of regulatory frameworks or policies set at the organizational level either to translate high-level compliance requirements or to meet business needs. The taxonomy comprises nine dimensions. It is used as a stand-alone classifier of explanations conceived as detective controls, in order to aid supportive automated compliance strategies. A machinereadable format of the taxonomy is provided in the form of a light ontology and the benefits of starting the Explainability-by-Design journey with such a taxonomy are demonstrated through a series of examples.


Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

Nghiem, Khanh, Nguyen, Anh Minh, Bui, Nghi D. Q.

arXiv.org Artificial Intelligence

AI coding assistants should set stages of developing AI4SE tools that consistently produce highquality clear expectations for usage, integrate with advanced IDE capabilities results for specific coding tasks [5] [3] [4]. Academic researchers and existing extensions, use extendable backend designs, and and industry practitioners lack well-defined frameworks collect app data responsibly for downstream analyses. We propose for positioning and evaluating emerging AI coding assistants in the open questions and challenges that academia and industry should traditional programming paradigms[11] [2], while users lack clear address to realize the vision of next-generation AI coding assistants.


Why embracing complexity is the real challenge in software today

MIT Technology Review

The reason we can't just wish away or "fix" complexity is that every solution--whether it's a technology or methodology--redistributes complexity in some way. When microservices emerged (a software architecture approach where an application or system is composed of many smaller parts), they seemingly solved many of the maintenance and development challenges posed by monolithic architectures (where the application is one single interlocking system). However, in doing so microservices placed new demands on engineering teams; they require greater maturity in terms of practices and processes. This is one of the reasons why we cautioned people against what we call "microservice envy" in a 2018 edition of the Technology Radar, with CTO Rebecca Parsons writing that microservices would never be recommended for adoption on Technology Radar because "not all organizations are microservices-ready." We noticed there was a tendency to look to adopt microservices simply because it was fashionable.


Engineering Manager (AI & ML) at Cleo AI Ltd - London

#artificialintelligence

If there's anything we can do to accommodate your specific situation, please let us know. Most people come to Cleo to do work that matters. Every day, we fight for the world's financial health, building a beloved AI that empowers people to make better financial decisions. Backed by some of the most well-known investors in tech, we've reached over 4 million users and plan to double that number each year...which is where you come in. We are doubling the engineering team in 2023, and need more technical leaders to future proof our team!


Generative AI: Unlocking the future of fashion

#artificialintelligence

As this season's fashion weeks wrap up in London, Milan, New York, and Paris, brands are working to produce and sell the designs they've just showcased on runways--and they're starting next season's collections. In the future, it's entirely possible that those designs will blend the prowess of a creative director with the power of generative artificial intelligence (AI), helping to bring clothes and accessories to market faster, selling them more efficiently, and improving the customer experience. By now, you've likely heard of OpenAI's ChatGPT, the AI chatbot that became an overnight sensation and sparked a digital race to build and release competitors. ChatGPT is only one consumer-friendly example of generative AI, a technology comprising algorithms that can be used to create new content, including audio, code, images, text, simulations, and videos. Rather than simply identifying and classifying information, generative AI creates new information by leveraging foundation models, which are deep learning models that can handle multiple complex tasks at the same time.


Model Rollbacks Through Versioning

#artificialintelligence

There's general consensus in the Machine Learning community that models can and have made biased decisions against traditionally marginalized groups. Ethical AI researchers from Dr. Cathy O'Neil to Dr. Joy Buolamwini have gone to great lengths to establish a pattern of faulty decision making rooted in biased and unrepresentative data that result in serious harms. Unfortunately, our "intelligent" learning algorithms are only as smart, capable and ethical as we make them and we are only at the beginning of understanding the long term effects of biased models. Fortunately, there are many strategies already at our disposal that we can use to mitigate harms when they arise. Today, we will focus on a very powerful strategy: Model Rollbacks through Versioning.


Tapad - Lead Data Scientist (Remote) at Experian - New York City, United States

#artificialintelligence

Position is open to Remote candidates in the US or Canada. Must be able to work starting around 9am EST to connect with teammates in Oslo, Norway. Founded in 2010, Tapad cracked the code on cross-device marketing technology. Our groundbreaking, proprietary technology assimilates trillions of data points to find the relationship between smartphones, desktops, laptops, tablets, and connected TVs. Ten years later, we are processing data at petabyte scale, with an engineering team that comprises roughly half of our entire organization.